Logo_Conestoga

Artificial Intelligence for Business Decisions and Transformation  

CSCN8030 - Spring 2024 - Section 2  

Professor: Glaucia Melo dos Santos, PhD  

Group 4 - Members:

Krishna Kumar, Hemasree
Shijin, Jency
Fernandez, Arcadio

Sprint 1 Development:¶

Project: US Traffic Accident Severity Prediction¶

US Accidents (2016 - 2023)
A Countrywide Traffic Accident Dataset (2016 - 2023)

1. Introduction: A brief introduction to the problem, libraries and obtaining the data.¶

1.1. A brief introduction to the problem¶

1.1.1. Why this is topic important?
There were 39,508 fatal motor vehicle crashes in the United States in 2021 in which 42,939 deaths occurred. This resulted in 12.9 deaths per 100,000 people and 1.37 deaths per 100 million miles traveled. The fatality rate per 100,000 people ranged from 5.7 in Rhode Island to 26.2 in Mississippi. The death rate per 100 million miles traveled ranged from 0.71 in Massachusetts to 2.08 in South Carolina.

1.1.2. How could this project make an impact in society?
Accident severity modeling helps understand contributing factors and develop preventive strategies. AI models, such as random forest, offer adaptability and higher predictive accuracy compared to traditional statistical models. This study aims to develop a predictive model for traffic accident severity on USA highways ML algorithm.

1.1.3. Data description
This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records.For more information about this dataset, please visit Kaggle: Dataset

1.1.4. Indication of Reference Code
On Kaggle at US Traffic Accidents (2016 - 2023) Dataset, there are 371 codes. Two of them caught our attention because of its organization and the way the result is exposed:

https://www.kaggle.com/code/jingzongwang/usa-car-accidents-severity-prediction
https://www.kaggle.com/code/satyabrataroy/60-insights-extraction-us-accident-analysis

1.2. Importing all the essential libraries and sub modules¶

In [5]:
pip install geopandas geopy
Requirement already satisfied: geopandas in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (0.14.4)
Requirement already satisfied: geopy in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (2.4.1)
Requirement already satisfied: fiona>=1.8.21 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (1.9.6)
Requirement already satisfied: numpy>=1.22 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (1.26.3)
Requirement already satisfied: packaging in c:\users\arcad\appdata\roaming\python\python311\site-packages (from geopandas) (23.0)
Requirement already satisfied: pandas>=1.4.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (2.0.1)
Requirement already satisfied: pyproj>=3.3.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=1.8.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from geopandas) (2.0.4)
Requirement already satisfied: geographiclib<3,>=1.52 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from geopy) (2.0)
Requirement already satisfied: attrs>=19.2.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from fiona>=1.8.21->geopandas) (23.2.0)
Requirement already satisfied: certifi in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from fiona>=1.8.21->geopandas) (2023.11.17)
Requirement already satisfied: click~=8.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from fiona>=1.8.21->geopandas) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from fiona>=1.8.21->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from fiona>=1.8.21->geopandas) (0.7.2)
Requirement already satisfied: six in c:\users\arcad\appdata\roaming\python\python311\site-packages (from fiona>=1.8.21->geopandas) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\arcad\appdata\roaming\python\python311\site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.4)
Requirement already satisfied: colorama in c:\users\arcad\appdata\roaming\python\python311\site-packages (from click~=8.0->fiona>=1.8.21->geopandas) (0.4.6)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.patches as mpatches
import geopandas as gpd
import fiona
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import csr_matrix

1.3. Obtaining the Data¶

In [7]:
df_accidents_dataset = pd.read_csv(r"C:\Users\arcad\Labs_term2\US_Accidents_March23.csv")

display(df_accidents_dataset)
ID Source Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) ... Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-1 Source2 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN 0.010 ... False False False False False False Night Night Night Night
1 A-2 Source2 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN 0.010 ... False False False False False False Night Night Night Day
2 A-3 Source2 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN 0.010 ... False False False False True False Night Night Day Day
3 A-4 Source2 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN 0.010 ... False False False False False False Night Day Day Day
4 A-5 Source2 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN 0.010 ... False False False False True False Day Day Day Day
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7728389 A-7777757 Source1 2 2019-08-23 18:03:25 2019-08-23 18:32:01 34.002480 -117.379360 33.99888 -117.37094 0.543 ... False False False False False False Day Day Day Day
7728390 A-7777758 Source1 2 2019-08-23 19:11:30 2019-08-23 19:38:23 32.766960 -117.148060 32.76555 -117.15363 0.338 ... False False False False False False Day Day Day Day
7728391 A-7777759 Source1 2 2019-08-23 19:00:21 2019-08-23 19:28:49 33.775450 -117.847790 33.77740 -117.85727 0.561 ... False False False False False False Day Day Day Day
7728392 A-7777760 Source1 2 2019-08-23 19:00:21 2019-08-23 19:29:42 33.992460 -118.403020 33.98311 -118.39565 0.772 ... False False False False False False Day Day Day Day
7728393 A-7777761 Source1 2 2019-08-23 18:52:06 2019-08-23 19:21:31 34.133930 -117.230920 34.13736 -117.23934 0.537 ... False False False False False False Day Day Day Day

7728394 rows × 46 columns

2. Exploratory data analysis - EDA¶

2.1. Basic data analysis¶

In [8]:
# Number of rows and columns
print('The Dataset Contains, Rows: {:,d} & Columns: {}'.format(df_accidents_dataset.shape[0], df_accidents_dataset.shape[1]))
The Dataset Contains, Rows: 7,728,394 & Columns: 46
In [9]:
# `info()` method prints information about the DataFrame. 
df_accidents_dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7728394 entries, 0 to 7728393
Data columns (total 46 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Source                 object 
 2   Severity               int64  
 3   Start_Time             object 
 4   End_Time               object 
 5   Start_Lat              float64
 6   Start_Lng              float64
 7   End_Lat                float64
 8   End_Lng                float64
 9   Distance(mi)           float64
 10  Description            object 
 11  Street                 object 
 12  City                   object 
 13  County                 object 
 14  State                  object 
 15  Zipcode                object 
 16  Country                object 
 17  Timezone               object 
 18  Airport_Code           object 
 19  Weather_Timestamp      object 
 20  Temperature(F)         float64
 21  Wind_Chill(F)          float64
 22  Humidity(%)            float64
 23  Pressure(in)           float64
 24  Visibility(mi)         float64
 25  Wind_Direction         object 
 26  Wind_Speed(mph)        float64
 27  Precipitation(in)      float64
 28  Weather_Condition      object 
 29  Amenity                bool   
 30  Bump                   bool   
 31  Crossing               bool   
 32  Give_Way               bool   
 33  Junction               bool   
 34  No_Exit                bool   
 35  Railway                bool   
 36  Roundabout             bool   
 37  Station                bool   
 38  Stop                   bool   
 39  Traffic_Calming        bool   
 40  Traffic_Signal         bool   
 41  Turning_Loop           bool   
 42  Sunrise_Sunset         object 
 43  Civil_Twilight         object 
 44  Nautical_Twilight      object 
 45  Astronomical_Twilight  object 
dtypes: bool(13), float64(12), int64(1), object(20)
memory usage: 2.0+ GB
In [10]:
# `.nunique()` function counts the number of distinct entries in the column, ignoring any NaN values by default.
num_unique_cities = df_accidents_dataset['City'].nunique()

print(f"The number of unique cities is: {num_unique_cities}")
The number of unique cities is: 13678
In [11]:
# `.nunique()` function counts the number of distinct entries in the column, ignoring any NaN values by default.
num_unique_source = df_accidents_dataset['Source'].nunique()

print(f"The number of source is: {num_unique_source}")
The number of source is: 3
In [12]:
# `.nunique()` function counts the number of distinct entries in the column, ignoring any NaN values by default.
num_unique_severity = df_accidents_dataset['Severity'].nunique()

print(f"The number of severity is: {num_unique_severity}")
The number of severity is: 4

2.2. Statistical analysis¶

In [13]:
df_accidents_dataset.describe()
Out[13]:
Severity Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Speed(mph) Precipitation(in)
count 7.728394e+06 7.728394e+06 7.728394e+06 4.325632e+06 4.325632e+06 7.728394e+06 7.564541e+06 5.729375e+06 7.554250e+06 7.587715e+06 7.551296e+06 7.157161e+06 5.524808e+06
mean 2.212384e+00 3.620119e+01 -9.470255e+01 3.626183e+01 -9.572557e+01 5.618423e-01 6.166329e+01 5.825105e+01 6.483104e+01 2.953899e+01 9.090376e+00 7.685490e+00 8.407210e-03
std 4.875313e-01 5.076079e+00 1.739176e+01 5.272905e+00 1.810793e+01 1.776811e+00 1.901365e+01 2.238983e+01 2.282097e+01 1.006190e+00 2.688316e+00 5.424983e+00 1.102246e-01
min 1.000000e+00 2.455480e+01 -1.246238e+02 2.456601e+01 -1.245457e+02 0.000000e+00 -8.900000e+01 -8.900000e+01 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 2.000000e+00 3.339963e+01 -1.172194e+02 3.346207e+01 -1.177543e+02 0.000000e+00 4.900000e+01 4.300000e+01 4.800000e+01 2.937000e+01 1.000000e+01 4.600000e+00 0.000000e+00
50% 2.000000e+00 3.582397e+01 -8.776662e+01 3.618349e+01 -8.802789e+01 3.000000e-02 6.400000e+01 6.200000e+01 6.700000e+01 2.986000e+01 1.000000e+01 7.000000e+00 0.000000e+00
75% 2.000000e+00 4.008496e+01 -8.035368e+01 4.017892e+01 -8.024709e+01 4.640000e-01 7.600000e+01 7.500000e+01 8.400000e+01 3.003000e+01 1.000000e+01 1.040000e+01 0.000000e+00
max 4.000000e+00 4.900220e+01 -6.711317e+01 4.907500e+01 -6.710924e+01 4.417500e+02 2.070000e+02 2.070000e+02 1.000000e+02 5.863000e+01 1.400000e+02 1.087000e+03 3.647000e+01

2.3. City Analysis¶

In [14]:
#DataFrame to count accident per citye
city_df = pd.DataFrame(df_accidents_dataset['City'].value_counts().reset_index())

# Calculate the total number of accident
total_cases = city_df['count'].sum()

# Create a new column for the percentage of each city's cases relative to the total
city_df['Percentage'] = (city_df['count'] / total_cases) * 100

# Adjust the index to start from 1
city_df.index = pd.RangeIndex(start=1, stop=len(city_df) + 1, step=1)

# Display the DataFrame 
display(city_df.head(20))
City count Percentage
1 Miami 186917 2.418654
2 Houston 169609 2.194693
3 Los Angeles 156491 2.024950
4 Charlotte 138652 1.794118
5 Dallas 130939 1.694314
6 Orlando 109733 1.419915
7 Austin 97359 1.259798
8 Raleigh 86079 1.113838
9 Nashville 72930 0.943694
10 Baton Rouge 71588 0.926329
11 Atlanta 68186 0.882308
12 Sacramento 66264 0.857438
13 San Diego 55504 0.718206
14 Phoenix 53974 0.698409
15 Minneapolis 51488 0.666240
16 Richmond 48845 0.632041
17 Oklahoma City 46092 0.596418
18 Jacksonville 42447 0.549252
19 Tucson 39304 0.508583
20 Columbia 38178 0.494013
In [15]:
# Data preparation
cities = city_df['City'].head(10)
counts = city_df['count'].head(10)
percentages = city_df['Percentage'].head(10)

# X locations for the groups
x = np.arange(len(cities))

# Plotting
fig, ax1 = plt.figure(figsize=(14, 8)), plt.gca()

# Bar chart for counts
width = 0.35  # the width of the bars
rects1 = ax1.bar(x - width/2, counts, width, label='Count', color='blue')

# Create twin axis for percentage
ax2 = ax1.twinx()  
rects2 = ax2.bar(x + width/2, percentages, width, label='Percentage', color='green')

# Function to add labels on the bars
def add_labels(bars):
    for bar in bars:
        height = bar.get_height()
        label_position = height + (height * 0.01)  # Adjust position above bar
        ax2.annotate(f'{height:.2f}%',
                     xy=(bar.get_x() + bar.get_width() / 2, label_position),
                     xytext=(0, 3),  # 3 points vertical offset
                     textcoords="offset points",
                     ha='center', va='bottom')

# Adding labels to the percentage bars
add_labels(rects2)

# Add some text for labels, title and custom x-axis tick labels, etc.
ax1.set_xlabel('City')
ax1.set_ylabel('Count', color='blue')
ax2.set_ylabel('Percentage (%)', color='green')
ax1.set_title("City with Maximum no. of Road Accidents\n (2016 - March 2023)")
ax1.set_xticks(x)
ax1.set_xticklabels(cities, rotation=45)
ax1.legend(loc='upper center')
ax2.legend(loc='upper right')

# Display the plot
plt.show()

Key Observations:

  1. Miami is the city with highest (2.42%) no. of road accidents in US (2016-2020).

  2. Houston is the city with 2nd highest (2.19%) no. of road accidents in US (2016-2020).

  3. Around 16% accident records of past 7 years are only from these 10 cities out of 13,678 cities in US .

In [16]:
# import geopandas as gpd
# import fiona
# import matplotlib.pyplot as plt
# import pandas as pd

# Assuming city_df contains the aggregated data of top 10 cities
city_df = df_accidents_dataset.groupby('City').size().reset_index(name='count')
city_df = city_df.nlargest(10, 'count')

# Merge with the main dataset to get the longitude and latitude values
top_10_cities = df_accidents_dataset[df_accidents_dataset['City'].isin(city_df['City'])]

# Calculate mean latitude and longitude for each city
geo_data = top_10_cities.groupby('City').agg({
    'Start_Lat': 'mean',
    'Start_Lng': 'mean'
}).reset_index()

# Merge the city_df with geo_data to have complete information
merged_df = pd.merge(city_df, geo_data, on='City')

# Set the environment variable using Fiona's context
with fiona.Env(SHAPE_RESTORE_SHX='YES'):
    try:
        usa = gpd.read_file(r'C:\Users\arcad\Labs_term2\States_shapefile.shp')
    except Exception as e:
        print(f"Error reading shapefile: {e}")

# Convert merged DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(merged_df, geometry=gpd.points_from_xy(merged_df['Start_Lng'], merged_df['Start_Lat']))

# Plotting
fig, ax = plt.subplots(figsize=(15, 10))
usa.boundary.plot(ax=ax)  # Plot USA boundaries
gdf.plot(ax=ax, color='red', markersize=50)  # Plot cities on the map

# Adding city names
for x, y, label in zip(gdf.geometry.x, gdf.geometry.y, gdf['City']):
    ax.annotate(label, xy=(x, y), xytext=(3, 3), textcoords="offset points")

plt.title('Top 10 Cities with the Most Accidents')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True)
plt.show()

2.4. Severity Analysis¶

In [17]:
df_source = df_accidents_dataset.groupby(['Severity','Source']).size().reset_index().pivot(\
    columns='Severity', index='Source', values=0)
df_source.plot(kind='bar', stacked=True, title='Severity Count by Sources')

    
# df_source = df.groupby(['Severity','Source']).size().reset_index().pivot(\
#     columns='Severity', index='Source', values=0)
# df_source.plot(kind='bar', stacked=True, title='Severity Count by Sources')
Out[17]:
<Axes: title={'center': 'Severity Count by Sources'}, xlabel='Source'>
In [18]:
# import pandas as pd

# Assuming df_accidents_dataset is the DataFrame containing the accident data

# Calculate the count of each severity class
severity_counts = df_accidents_dataset['Severity'].value_counts()

# Calculate the total number of entries
total_counts = len(df_accidents_dataset)

# Calculate the percentage for each severity class
severity_percentage = (severity_counts / total_counts) * 100

# Display the result
severity_percentage = severity_percentage.sort_index()
severity_percentage_df = pd.DataFrame({
    'Severity': severity_percentage.index,
    'Percentage': severity_percentage.values
})

print(severity_percentage_df)
   Severity  Percentage
0         1    0.871669
1         2   79.667017
2         3   16.812510
3         4    2.648804
In [71]:
import matplotlib.pyplot as plt

# Assuming severity_percentage_df is the DataFrame containing severity percentages

# Plotting the percentage of each severity class
fig, ax = plt.subplots(figsize=(10, 6))

ax.bar(severity_percentage_df['Severity'], severity_percentage_df['Percentage'], color=['blue', 'orange', 'green', 'red'])

# Adding title and labels
plt.title('Percentage of Accidents by Severity')
plt.xlabel('Severity')
plt.ylabel('Percentage (%)')
plt.xticks(severity_percentage_df['Severity'], ['Severity 1', 'Severity 2', 'Severity 3', 'Severity 4'])

# Display the percentage on top of each bar
for index, value in enumerate(severity_percentage_df['Percentage']):
    ax.text(severity_percentage_df['Severity'][index], value + 0.5, f'{value:.2f}%', ha='center')

# Show the plot
plt.show()

Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).

In [80]:
# import geopandas as gpd
# import fiona
# import matplotlib.pyplot as plt
# import pandas as pd



# Filter the necessary columns
df_severity = df_accidents_dataset[['City', 'Start_Lat', 'Start_Lng', 'Severity']]

# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df_severity, geometry=gpd.points_from_xy(df_severity['Start_Lng'], df_severity['Start_Lat']))

# Set the environment variable using Fiona's context
with fiona.Env(SHAPE_RESTORE_SHX='YES'):
    try:
        usa = gpd.read_file(r"C:\Users\arcad\Labs_term2\united-statesgeojson.shp")
    except Exception as e:
        print(f"Error reading shapefile: {e}")

# Plotting
fig, ax = plt.subplots(figsize=(15, 10))
usa.boundary.plot(ax=ax)  # Plot USA boundaries

# Plot each severity level with different colors
severity_colors = {1: 'blue', 2: 'orange', 3: 'green', 4: 'red'}


for severity in gdf['Severity'].unique():
    gdf[gdf['Severity'] == severity].plot(ax=ax, color=severity_colors[severity], markersize=5, label=f'Severity {severity}')

# Adding title and legend
plt.title('Accidents by Severity in the US')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True)
plt.legend()
plt.show()
In [82]:
# import geopandas as gpd
# import fiona
# import matplotlib.pyplot as plt
# import pandas as pd

# Assuming df_accidents_dataset is the DataFrame containing the accident data

# Filter the necessary columns
df_severity = df_accidents_dataset[['City', 'Start_Lat', 'Start_Lng', 'Severity']]

# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df_severity, geometry=gpd.points_from_xy(df_severity['Start_Lng'], df_severity['Start_Lat']))

# Set the environment variable using Fiona's context
with fiona.Env(SHAPE_RESTORE_SHX='YES'):
    try:
        usa = gpd.read_file(r"C:\Users\arcad\Labs_term2\united-statesgeojson.shp")
    except Exception as e:
        print(f"Error reading shapefile: {e}")

# Plotting each severity level separately
severity_colors = {1: 'blue', 2: 'orange', 3: 'green', 4: 'red'}

for severity in gdf['Severity'].unique():
    fig, ax = plt.subplots(figsize=(15, 10))
    usa.boundary.plot(ax=ax)  # Plot USA boundaries
    gdf[gdf['Severity'] == severity].plot(ax=ax, color=severity_colors[severity], markersize=5, label=f'Severity {severity}')
    
    # Adding title and legend
    plt.title(f'Accidents with Severity {severity} in the US')
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.grid(True)
    plt.legend()
    plt.show()
In [84]:
import geopandas as gpd
import fiona
import matplotlib.pyplot as plt
import pandas as pd



# Filter the necessary columns
df_severity = df_accidents_dataset[['City', 'Start_Lat', 'Start_Lng', 'Severity']]

# Get the top 500 cities with the highest number of accidents
top_500_cities = df_severity['City'].value_counts().nlargest(500).index
df_top_500 = df_severity[df_severity['City'].isin(top_500_cities)]

# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df_top_500, geometry=gpd.points_from_xy(df_top_500['Start_Lng'], df_top_500['Start_Lat']))

# Set the environment variable using Fiona's context
with fiona.Env(SHAPE_RESTORE_SHX='YES'):
    try:
        usa = gpd.read_file(r"C:\Users\arcad\Labs_term2\united-statesgeojson.shp")
    except Exception as e:
        print(f"Error reading shapefile: {e}")

# Plotting each severity level separately
severity_colors = {1: 'blue', 2: 'orange', 3: 'green', 4: 'red'}

for severity in gdf['Severity'].unique():
    fig, ax = plt.subplots(figsize=(15, 10))
    usa.boundary.plot(ax=ax)  # Plot USA boundaries
    gdf[gdf['Severity'] == severity].plot(ax=ax, color=severity_colors[severity], markersize=5, label=f'Severity {severity}')
    
    # Adding title and legend
    plt.title(f'Accidents with Severity {severity} in the Top 500 Cities with the Most Accidents in the US')
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.grid(True)
    plt.legend()
    plt.show()

3. Data Preprocessing¶

3.1. Drop some unneeded columns in the dataset¶

'ID' doesn't provide any useful information about accidents. 'Distance(mi)' can be collected only after the accident has already happened and hence cannot be used for serious accident prediction. For 'Description', the POI features have already been extracted from it by dataset creators. Number, Street, County, State, Zipcode, Country: we just focus on the City where the accident happened Timezone, Airport_Code, Weather_Timestamp: not usefull information

In [23]:
df_accidents_dataset = df_accidents_dataset[df_accidents_dataset['Source'] == 'Source1']
In [24]:
# Drop some useless features

df_accidents_dataset = df_accidents_dataset.drop(['ID','Source','Description','Distance(mi)', 'End_Time', 
              'End_Lat', 'End_Lng'], axis=1)

3.2. Drop duplicates in the dataset¶

In [25]:
print("Number of rows:", len(df_accidents_dataset.index))
df_accidents_dataset.drop_duplicates(inplace=True)
print("Number of rows after drop of duplicates:", len(df_accidents_dataset.index))
Number of rows: 4325632
Number of rows after drop of duplicates: 3960160

3.3. Handle duplicate values in columns¶

In [26]:
print("Wind Direction: ", df_accidents_dataset['Wind_Direction'].unique())
Wind Direction:  ['SW' 'Calm' 'WSW' 'WNW' 'West' 'NNW' 'South' 'W' 'NW' 'North' 'SSE' 'SSW'
 'ESE' 'SE' nan 'East' 'Variable' 'NNE' 'NE' 'ENE' 'CALM' 'S' 'VAR' 'N'
 'E']
In [27]:
df_accidents_dataset.loc[df_accidents_dataset['Wind_Direction']=='Calm','Wind_Direction'] = 'CALM'
df_accidents_dataset.loc[(df_accidents_dataset['Wind_Direction']=='West')|(df_accidents_dataset['Wind_Direction']=='WSW')|(df_accidents_dataset['Wind_Direction']=='WNW'),'Wind_Direction'] = 'W'
df_accidents_dataset.loc[(df_accidents_dataset['Wind_Direction']=='South')|(df_accidents_dataset['Wind_Direction']=='SSW')|(df_accidents_dataset['Wind_Direction']=='SSE'),'Wind_Direction'] = 'S'
df_accidents_dataset.loc[(df_accidents_dataset['Wind_Direction']=='North')|(df_accidents_dataset['Wind_Direction']=='NNW')|(df_accidents_dataset['Wind_Direction']=='NNE'),'Wind_Direction'] = 'N'
df_accidents_dataset.loc[(df_accidents_dataset['Wind_Direction']=='East')|(df_accidents_dataset['Wind_Direction']=='ESE')|(df_accidents_dataset['Wind_Direction']=='ENE'),'Wind_Direction'] = 'E'
df_accidents_dataset.loc[df_accidents_dataset['Wind_Direction']=='Variable','Wind_Direction'] = 'VAR'
print("Wind Direction after duplicate removal: ", df_accidents_dataset['Wind_Direction'].unique())
Wind Direction after duplicate removal:  ['SW' 'CALM' 'W' 'N' 'S' 'NW' 'E' 'SE' nan 'VAR' 'NE']
In [28]:
print(df_accidents_dataset["Weather_Condition"].unique())
['Light Rain' 'Overcast' 'Mostly Cloudy' 'Snow' 'Light Snow' 'Cloudy' nan
 'Scattered Clouds' 'Clear' 'Partly Cloudy' 'Light Freezing Drizzle'
 'Light Drizzle' 'Haze' 'Rain' 'Heavy Rain' 'Fair' 'Drizzle' 'Fog'
 'Thunderstorms and Rain' 'Patches of Fog' 'Light Thunderstorms and Rain'
 'Mist' 'Rain Showers' 'Light Rain Showers' 'Heavy Drizzle' 'Smoke'
 'Light Freezing Fog' 'Light Freezing Rain' 'Blowing Snow'
 'Heavy Thunderstorms and Rain' 'Heavy Snow' 'Snow Grains' 'Squalls'
 'Light Fog' 'Shallow Fog' 'Thunderstorm' 'Light Ice Pellets' 'Thunder'
 'Thunder in the Vicinity' 'Fair / Windy' 'Light Rain with Thunder'
 'Heavy Thunderstorms and Snow' 'Light Snow Showers' 'Cloudy / Windy'
 'Ice Pellets' 'N/A Precipitation' 'Light Thunderstorms and Snow'
 'T-Storm' 'Rain / Windy' 'Wintry Mix' 'Partly Cloudy / Windy'
 'Heavy T-Storm' 'Sand' 'Light Rain / Windy' 'Widespread Dust'
 'Mostly Cloudy / Windy' 'Blowing Dust / Windy' 'Blowing Dust'
 'Volcanic Ash' 'Freezing Rain / Windy' 'Small Hail' 'Wintry Mix / Windy'
 'Light Snow / Windy' 'Heavy Ice Pellets' 'Heavy Snow / Windy'
 'Heavy Rain / Windy' 'Heavy T-Storm / Windy' 'Fog / Windy' 'Dust Whirls'
 'Showers in the Vicinity' 'Funnel Cloud' 'Thunder / Wintry Mix'
 'Light Snow and Sleet' 'Thunder / Wintry Mix / Windy' 'Snow and Sleet'
 'Haze / Windy' 'Freezing Rain' 'Mist / Windy' 'T-Storm / Windy'
 'Snow / Windy' 'Light Drizzle / Windy' 'Drizzle and Fog'
 'Light Rain Shower' 'Light Freezing Rain / Windy'
 'Snow and Sleet / Windy' 'Rain Shower / Windy' 'Drizzle / Windy' 'Hail'
 'Light Snow with Thunder' 'Rain Shower' 'Light Snow Shower'
 'Heavy Snow with Thunder' 'Blowing Snow / Windy' 'Thunder / Windy'
 'Sleet and Thunder' 'Light Sleet' 'Smoke / Windy' 'Sleet'
 'Heavy Freezing Drizzle' 'Widespread Dust / Windy'
 'Heavy Sleet and Thunder' 'Drifting Snow / Windy' 'Freezing Drizzle'
 'Snow and Thunder' 'Light Sleet / Windy' 'Sand / Dust Whirls Nearby'
 'Thunder and Hail' 'Sleet / Windy' 'Heavy Sleet'
 'Light Snow and Sleet / Windy' 'Shallow Fog / Windy' 'Squalls / Windy'
 'Light Rain Shower / Windy' 'Tornado' 'Thunder and Hail / Windy'
 'Light Snow Shower / Windy' 'Sand / Dust Whirlwinds'
 'Heavy Sleet / Windy' 'Sand / Windy' 'Duststorm'
 'Heavy Rain Shower / Windy' 'Snow and Thunder / Windy'
 'Sand / Dust Whirlwinds / Windy' 'Blowing Snow Nearby' 'Partial Fog'
 'Heavy Freezing Rain' 'Blowing Sand' 'Patches of Fog / Windy'
 'Heavy Rain Shower' 'Drifting Snow' 'Heavy Blowing Snow'
 'Low Drifting Snow' 'Light Blowing Snow' 'Heavy Rain Showers'
 'Light Haze' 'Heavy Thunderstorms with Small Hail']
In [29]:
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Thunder|T-Storm", na=False), "Weather_Condition"] = "Thunderstorm"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Snow|Sleet|Wintry", na=False), "Weather_Condition"] = "Snow"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Rain|Drizzle|Shower", na=False), "Weather_Condition"] = "Rain"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Wind|Squalls", na=False), "Weather_Condition"] = "Windy"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Hail|Pellets", na=False), "Weather_Condition"] = "Hail"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Fair", na=False), "Weather_Condition"] = "Clear"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Cloud|Overcast", na=False), "Weather_Condition"] = "Cloudy"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Mist|Haze|Fog", na=False), "Weather_Condition"] = "Fog"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Sand|Dust", na=False), "Weather_Condition"] = "Sand"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("Smoke|Volcanic Ash", na=False), "Weather_Condition"] = "Smoke"
df_accidents_dataset.loc[df_accidents_dataset["Weather_Condition"].str.contains("N/A Precipitation", na=False), "Weather_Condition"] = np.nan

print(df_accidents_dataset["Weather_Condition"].unique())
['Rain' 'Cloudy' 'Snow' nan 'Clear' 'Fog' 'Thunderstorm' 'Smoke' 'Windy'
 'Hail' 'Sand' 'Tornado']
In [30]:
df_accidents_dataset = df_accidents_dataset.drop(['Country','Turning_Loop','Weather_Timestamp'], axis=1)

3.4. Check for missing values in the dataset¶

In [31]:
# import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns

# Display the total number of missing values per column
missing_values = df_accidents_dataset.isnull().sum()
print(missing_values)
Severity                      0
Start_Time                    0
Start_Lat                     0
Start_Lng                     0
Street                     8285
City                        178
County                        0
State                         0
Zipcode                    1401
Timezone                   4707
Airport_Code              15051
Temperature(F)            97138
Wind_Chill(F)            519487
Humidity(%)              102897
Pressure(in)              83628
Visibility(mi)           102143
Wind_Direction           109720
Wind_Speed(mph)          195827
Precipitation(in)        602707
Weather_Condition        101902
Amenity                       0
Bump                          0
Crossing                      0
Give_Way                      0
Junction                      0
No_Exit                       0
Railway                       0
Roundabout                    0
Station                       0
Stop                          0
Traffic_Calming               0
Traffic_Signal                0
Sunrise_Sunset            19781
Civil_Twilight            19781
Nautical_Twilight         19781
Astronomical_Twilight     19781
dtype: int64
In [32]:
missing = pd.DataFrame(df_accidents_dataset.isnull().sum()).reset_index()
missing.columns = ['Feature', 'Missing_Percent(%)']
missing['Missing_Percent(%)'] = missing['Missing_Percent(%)'].apply(lambda x: x / df_accidents_dataset.shape[0] * 100)
missing.loc[missing['Missing_Percent(%)']>0,:]
Out[32]:
Feature Missing_Percent(%)
4 Street 0.209209
5 City 0.004495
8 Zipcode 0.035377
9 Timezone 0.118859
10 Airport_Code 0.380060
11 Temperature(F) 2.452881
12 Wind_Chill(F) 13.117829
13 Humidity(%) 2.598304
14 Pressure(in) 2.111733
15 Visibility(mi) 2.579264
16 Wind_Direction 2.770595
17 Wind_Speed(mph) 4.944926
18 Precipitation(in) 15.219259
19 Weather_Condition 2.573179
32 Sunrise_Sunset 0.499500
33 Civil_Twilight 0.499500
34 Nautical_Twilight 0.499500
35 Astronomical_Twilight 0.499500

3.1.1. Columns with a High Number of Missing Values: End_Lat and End_Lng: These columns have approximately 44% missing values. So, we will drop these columns. Wind_Chill(F): These columns have many missing values. So, we will also drop these columns.

Precipitation(in) : we will keep it as 'Precipitation(in)' could be a useful predictor

3.1.2. Columns with Moderate Missing Values: Description, Street, City, Zipcode, Airport_Code, Weather_Timestamp are not important predictors so, we can remove that. Temperature(F), Humidity(%), Pressure(in), Visibility(mi), Wind_Direction, Wind_Speed(mph), Weather_Condition: For these columns, we will impute missing values with appropriate statistical measures (mean, median, mode) or using more advanced techniques like K-Nearest Neighbors (KNN) imputation.

3.1.2. Columns with Few Missing Values: Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight: These columns have a relatively low number of missing values. We will the missing values.

3.5. Handling with missing values in the dataset¶

In [33]:
# Drop columns with high missing values
df_accidents_dataset = df_accidents_dataset.drop(columns='Wind_Chill(F)')

# Impute missing values for numerical columns with mean
numerical_cols = ['Temperature(F)', 'Humidity(%)', 'Precipitation(in)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
for col in numerical_cols:
    df_accidents_dataset[col] = df_accidents_dataset[col].fillna(df_accidents_dataset[col].mean())


# Remove rows with missing values in critical columns
df_accidents_dataset = df_accidents_dataset.dropna(subset=['Street','Weather_Condition', 'Start_Time',
                    'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight','Wind_Direction'])


# Verify missing values are handled
print(df_accidents_dataset.isnull().sum())
Severity                 0
Start_Time               0
Start_Lat                0
Start_Lng                0
Street                   0
City                     0
County                   0
State                    0
Zipcode                  0
Timezone                 0
Airport_Code             0
Temperature(F)           0
Humidity(%)              0
Pressure(in)             0
Visibility(mi)           0
Wind_Direction           0
Wind_Speed(mph)          0
Precipitation(in)        0
Weather_Condition        0
Amenity                  0
Bump                     0
Crossing                 0
Give_Way                 0
Junction                 0
No_Exit                  0
Railway                  0
Roundabout               0
Station                  0
Stop                     0
Traffic_Calming          0
Traffic_Signal           0
Sunrise_Sunset           0
Civil_Twilight           0
Nautical_Twilight        0
Astronomical_Twilight    0
dtype: int64
In [34]:
df_accidents_dataset= df_accidents_dataset.drop(['Amenity','Bump','Give_Way','No_Exit','Roundabout','Traffic_Calming'], axis=1)
In [35]:
df_accidents_dataset = df_accidents_dataset.drop(columns=['Street','County','State','Zipcode','Airport_Code'])

3.6 Add Features useful for prediction¶

In [36]:
# Convert 'Start_Time' and 'End_Time' to datetime
df_accidents_dataset['Start_Time'] = pd.to_datetime(df_accidents_dataset['Start_Time'], format='mixed')

# Create new features
df_accidents_dataset['Start_Year'] = df_accidents_dataset['Start_Time'].dt.year
df_accidents_dataset['Start_Month'] = df_accidents_dataset['Start_Time'].dt.month
df_accidents_dataset['Start_Day'] = df_accidents_dataset['Start_Time'].dt.day
df_accidents_dataset['Start_Hour'] = df_accidents_dataset['Start_Time'].dt.hour
df_accidents_dataset['Start_Minute'] = df_accidents_dataset['Start_Time'].dt.minute

# Drop the original columns
df_accidents_dataset = df_accidents_dataset.drop(columns=['Start_Time'])

3.7. Encode categorical variables into numerical format using techniques like one-hot encoding or label encoding.¶

Encode the boolean values in a numerical form.

In [37]:
df_accidents_dataset = df_accidents_dataset.replace([True, False], [1, 0])

df_accidents_dataset.head()
Out[37]:
Severity Start_Lat Start_Lng City Timezone Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction ... Traffic_Signal Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight Start_Year Start_Month Start_Day Start_Hour Start_Minute
3402762 3 40.10891 -83.09286 Dublin US/Eastern 42.1 58.0 29.76 10.0 SW ... 0 Night Night Night Night 2016 2 8 0 37
3402763 2 39.86542 -84.06280 Dayton US/Eastern 36.9 91.0 29.68 10.0 CALM ... 0 Night Night Night Night 2016 2 8 5 56
3402764 2 39.10266 -84.52468 Cincinnati US/Eastern 36.0 97.0 29.70 10.0 CALM ... 0 Night Night Night Day 2016 2 8 6 15
3402765 2 39.10148 -84.52341 Cincinnati US/Eastern 36.0 97.0 29.70 10.0 CALM ... 0 Night Night Night Day 2016 2 8 6 15
3402766 2 41.06213 -81.53784 Akron US/Eastern 39.0 55.0 29.65 10.0 CALM ... 0 Night Night Day Day 2016 2 8 6 51

5 rows × 28 columns

In [38]:
{column: len(df_accidents_dataset[column].unique()) for column in df_accidents_dataset.columns if df_accidents_dataset.dtypes[column] == 'object'}
Out[38]:
{'City': 11217,
 'Timezone': 4,
 'Wind_Direction': 10,
 'Weather_Condition': 11,
 'Sunrise_Sunset': 2,
 'Civil_Twilight': 2,
 'Nautical_Twilight': 2,
 'Astronomical_Twilight': 2}
In [39]:
cat_columns = ['Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight']
In [40]:
for column in cat_columns:
    df_accidents_dataset[column] = df_accidents_dataset[column].apply(lambda x: 1 if x == 'Day' else 0)
In [41]:
from scipy.sparse import hstack
categorical_columns_one_hot = ['Wind_Direction','Weather_Condition','Timezone']

def one_hot_encode_sparse(df, columns):
    df = df.copy()
    sparse_matrices = []
    for column in columns:
        dummies = pd.get_dummies(df[column], sparse=True)
        sparse_matrices.append(csr_matrix(dummies))
        df = df.drop(column, axis=1)
    sparse_combined = hstack(sparse_matrices)
    df_sparse_encoded = pd.DataFrame.sparse.from_spmatrix(sparse_combined)
    df.reset_index(drop=True, inplace=True)
    df_sparse_encoded.reset_index(drop=True, inplace=True)
    df_combined = pd.concat([df, df_sparse_encoded], axis=1)
    return df_combined

df_accidents_sparse_encoded = one_hot_encode_sparse(df_accidents_dataset, categorical_columns_one_hot)
In [42]:
pip install category_encoders
Requirement already satisfied: category_encoders in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (2.6.3)
Requirement already satisfied: numpy>=1.14.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from category_encoders) (1.26.3)
Requirement already satisfied: scikit-learn>=0.20.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from category_encoders) (1.4.2)
Requirement already satisfied: scipy>=1.0.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from category_encoders) (1.11.4)
Requirement already satisfied: statsmodels>=0.9.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from category_encoders) (0.14.1)
Requirement already satisfied: pandas>=1.0.5 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from category_encoders) (2.0.1)
Requirement already satisfied: patsy>=0.5.1 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from category_encoders) (0.5.6)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\arcad\appdata\roaming\python\python311\site-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.0.5->category_encoders) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.0.5->category_encoders) (2023.4)
Requirement already satisfied: six in c:\users\arcad\appdata\roaming\python\python311\site-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.2.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=0.20.0->category_encoders) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\arcad\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=0.20.0->category_encoders) (3.2.0)
Requirement already satisfied: packaging>=21.3 in c:\users\arcad\appdata\roaming\python\python311\site-packages (from statsmodels>=0.9.0->category_encoders) (23.0)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

To encode the City feature. In order to, reduce the usage of memory and the number of features we used the BinaryEncoder included in the library category_encoders.

In [43]:
import category_encoders as ce
binary_encoder = ce.binary.BinaryEncoder()

city_binary_enc = binary_encoder.fit_transform(df_accidents_sparse_encoded["City"])
df_accidents_sparse_encoded = pd.concat([df_accidents_sparse_encoded, city_binary_enc], axis=1).drop("City", axis=1)

3.8. Correlation matrix¶

In [44]:
# Calculate correlation matrix
plt.figure(figsize=(25,25))
cmap = sns.diverging_palette(220, 20, sep=20, as_cmap=True)
sns.heatmap(df_accidents_sparse_encoded.corr(), annot=True,cmap=cmap, center=0).set_title("Correlation Heatmap", fontsize=14)
plt.show()
In [81]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

numeric_cols = ['Severity','Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)', 'Start_Lng', 'Start_Lat', 'Start_Year', 'Start_Month', 'Start_Day', 'Start_Hour', 'Start_Minute']

# Filter the DataFrame to include only the specified numeric columns
df_numeric = df_accidents_sparse_encoded[numeric_cols]

# Calculate the correlation matrix
correlation_matrix = df_numeric.corr()

# Create the heatmap
plt.figure(figsize=(15, 15))
cmap = sns.diverging_palette(220, 20, sep=20, as_cmap=True)
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, center=0)
plt.title("Correlation Heatmap", fontsize=14)
plt.show()
In [89]:
more_cols = ['Severity','Crossing', 'Junction','Railway','Station','Stop','Stop', 'Traffic_Signal','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight']                           
               
                
# Filter the DataFrame to include only the specified numeric columns
df_numeric_2 = df_accidents_sparse_encoded[more_cols]

# Calculate the correlation matrix
correlation_matrix_2 = df_numeric_2.corr()

# Create the heatmap
plt.figure(figsize=(15, 15))
cmap = sns.diverging_palette(220, 20, sep=20, as_cmap=True)
sns.heatmap(correlation_matrix_2, annot=True, cmap=cmap, center=0)
plt.title("Correlation Heatmap", fontsize=14)
plt.show()               
   

3.9. Scale and normalize the features¶

In [45]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
In [85]:
# Select columns and apply scaling
numeric_cols = ['Temperature(F)','Humidity(%)','Pressure(in)','Visibility(mi)','Wind_Speed(mph)','Precipitation(in)','Start_Lng','Start_Lat','Start_Year', 'Start_Month','Start_Day','Start_Hour','Start_Minute']
In [47]:
df_accidents_sparse_encoded[numeric_cols] = scaler.fit_transform(df_accidents_sparse_encoded[numeric_cols])
In [50]:
import joblib
# After preprocessing and before fitting the model
feature_names = list(df_accidents_sparse_encoded.columns)

# Save feature names to a file
with open(r"C:\Users\arcad\Labs_term2\feature_names.pkl", 'wb') as f:
    joblib.dump(feature_names, f)
In [51]:
df_accidents_dataset.Severity.value_counts()
Out[51]:
Severity
2    3455874
4     174038
3     151456
1      24896
Name: count, dtype: int64

4. Splitting the Dataset¶

In [52]:
X = df_accidents_sparse_encoded.drop('Severity', axis=1)  # Features
y = df_accidents_sparse_encoded['Severity']  # Target variable
In [53]:
X.columns = X.columns.astype(str)
In [54]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [56]:
X_train.shape
Out[56]:
(3045011, 62)
In [57]:
X_test.shape
Out[57]:
(761253, 62)

5. Model Training¶

5.1. Training¶

In [58]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
In [59]:
model.fit(X_train, y_train)
c:\Users\arcad\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\validation.py:869: UserWarning: pandas.DataFrame with sparse columns found.It will be converted to a dense numpy array.
  warnings.warn(
Out[59]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)

5.2. Saving the Model¶

In [60]:
import joblib
In [61]:
joblib.dump(model, 'random_forest_model.joblib')
Out[61]:
['random_forest_model.joblib']

6. Model Evaluation¶

6.1. Evaluation¶

In [62]:
y_pred = model.predict(X_test)
c:\Users\arcad\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\validation.py:869: UserWarning: pandas.DataFrame with sparse columns found.It will be converted to a dense numpy array.
  warnings.warn(
In [63]:
target_names = ['Severity 1', 'Severity 2', 'Severity 3', 'Severity 4']
report1 = classification_report(y_test, y_pred, target_names=target_names, output_dict = True)
In [64]:
# Convert classification report to DataFrame
df_report = pd.DataFrame(report1).transpose()
In [93]:
df_report = df_report.loc[['Severity 1', 'Severity 2', 'Severity 3', 'Severity 4', 'accuracy'], 
                          ['precision', 'recall', 'f1-score', 'support']]

# Plotting the classification report as a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df_report.iloc[:, :-1], annot=True, cmap='Blues', linewidths=0.5)
plt.title('Classification Report Heatmap')
plt.xlabel('Metrics')
plt.ylabel('Classes')
plt.show()

6.1.1. Analysis of results shown in the heatmap¶

Accuracy: 0.92 - Consistent high accuracy.

Precision: Precision measures the ratio of true positive observations to the total predicted positives.

  • Severity 2: 0.94 - Very high precision, indicating excellent prediction accuracy for severity 2.
  • Severity 4: 0.47 - Low precision, indicating many false positives.

Recall: Recall measures the ratio of true positive observations to the total actual positives.

  • Severity 2: 0.98 - Very high recall, meaning most severity 2 cases are correctly identified.
  • Severity 4: 0.29 - Very low recall, indicating many severity 4 cases are missed.

F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall.

  • Severity 1: 0.63 - Moderate balance between precision and recall.
  • Severity 2: 0.96 - Very high F1-score, showing an excellent balance for severity 2.
  • Severity 4: 0.36 - Low F1-score, reflecting poor performance.

Final observations

  • Severity 2 has the best performance across all metrics, indicating the model is very effective at predicting this class.

Severity 4 has the poorest performance, with low precision, recall, and F1-score. T Severity 1 and Severity 3 have moderate performance.

6.2. Handling Class Imbalance by training a Model by assigning Higher weights to minority class¶

In [66]:
model2 = RandomForestClassifier(class_weight={1: 1, 2: 1, 3: 1, 4: 10}, random_state=42)
model2.fit(X_train, y_train)
c:\Users\arcad\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\validation.py:869: UserWarning: pandas.DataFrame with sparse columns found.It will be converted to a dense numpy array.
  warnings.warn(
Out[66]:
RandomForestClassifier(class_weight={1: 1, 2: 1, 3: 1, 4: 10}, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={1: 1, 2: 1, 3: 1, 4: 10}, random_state=42)
In [67]:
joblib.dump(model2, 'rf_model_classweights.joblib')
Out[67]:
['rf_model_classweights.joblib']
In [68]:
y_pred_cw = model2.predict(X_test)
c:\Users\arcad\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\validation.py:869: UserWarning: pandas.DataFrame with sparse columns found.It will be converted to a dense numpy array.
  warnings.warn(
In [69]:
report = classification_report(y_test, y_pred_cw, target_names=target_names,output_dict = True)
In [96]:
# Convert classification report to DataFrame
report_df = pd.DataFrame(report).transpose()

report_df = report_df.loc[['Severity 1', 'Severity 2', 'Severity 3', 'Severity 4', 'accuracy'], 
                          ['precision', 'recall', 'f1-score', 'support']]

# Plotting the classification report as a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(report_df.iloc[:, :-1], annot=True, cmap='Greens', linewidths=0.5)
plt.title('Classification Report Heatmap \n- with Higher weights to minority class')
plt.xlabel('Metrics')
plt.ylabel('Classes')
plt.show()

6.2.1. Analysis of results shown in the heatmap¶

Accuracy: 0.92 - Consistent high accuracy.

Precision: Precision measures the ratio of true positive observations to the total predicted positives. High precision relates to a low false positive rate.

  • Severity 1: 0.83 - High precision, indicating that most of the predicted severity 1 cases are correct.
  • Severity 2: 0.94 - Very high precision, indicating excellent prediction accuracy for severity 2.
  • Severity 4: 0.43 - Low precision, indicating many false positives.

Recall: Recall measures the ratio of true positive observations to the total actual positives. High recall relates to a low false negative rate

  • Severity 1: 0.48 - Low recall, indicating that many actual severity 1 cases are missed.
  • Severity 2: 0.98 - Very high recall, meaning most severity 2 cases are correctly identified.

F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall.

  • Severity 2: 0.96 - Very high F1-score, showing an excellent balance for severity 2.
  • Severity 4: 0.35 - Very low F1-score, reflecting poor performance.

Final observations

  • Severity 2 has the best performance across all metrics, indicating the model is very effective at predicting this class with high precision, recall, and F1-score.
  • Severity 4 has the poorest performance, with low precision, recall, and F1-score.